Resample and composite engine for real-time volume rendering

ABSTRACT

The present invention is a digital electronic system for rendering a volume image in real time. The system accelerators the processing of voxels through early ray termination and space leaping techniques in the projection guided ray casting of the voxels. Predictable and regular voxel access from high-speed internal memory further accelerates the volume rendering. Through the acceleration techniques and devices of the present invention real-time rendering of parallel and perspective views, including those for stereoscopic viewing, are achieved.

FIELD OF THE INVENTION

The present invention is a system for providing three-dimensionalcomputer graphics. More particularly, the present invention is a systemthat accelerates the processing of volume data for real-time ray castingof a three-dimensional image and a method thereof.

BACKGROUND OF THE INVENTION

Volume rendering projects a volume dataset onto a two-dimensional (2D)image plane or frame-buffer. Volume rendering can be used to view andanalyze three-dimensional (3D) data from various disciplines, such asbiomedicine, geo-physics, computational fluid dynamics, finite elementmodels and computerized chemistry. Volume rendering is also useful inthe application of 3D graphics, such as Virtual Reality (VR), ComputerAided Design (CAD), computer games, computer graphics special effectsand the like. The various applications, however, may use a variety ofterms, such as 3D datasets, 3D images, volume images, stacks of 2Dimages and the like, to describe volume datasets.

As schematically depicted in FIG. 1, a volume dataset is typicallyorganized as a 3D array of samples which are often referred to as volumeelements or voxels. The volume dataset can vary in size, for examplefrom 128³ to 1024³ samples, and may also be non-symmetric, i.e.,512×512×128. The samples or voxels can also vary in size. For example, avoxel can be any useful number of bits, for instance 8 bits, 16 bits, 24bits, 32 bits or larger, and the like.

The volume dataset can be thought of as planes of voxels or slices. Eachslice is composed of rows or columns of voxels or beams. As depicted inFIG. 1, the voxels are uniform in size and regularly spaced on arectilinear grid. Volume datasets can also be classified intonon-rectilinear grids, for example curvilinear grids. These other typesof grids can be mapped onto regular grids.

Voxels may also represent various physical characteristics, such asdensity, temperature, velocity, pressure and color. Measurements, suchas area and volume, can be extracted from the volume datasets. A volumedataset may often contain more than a hundred million voxels therebyrequiring a large amount of storage. Because of the vast amount ofinformation contained in a dataset, interactive volume rendering orreal-time volume rendering defined below requires a large amount ofmemory bandwidth and computational throughput. These requirements oftenexceed the performance provided by typical modern workstations andpersonal computers.

Volume rendering techniques include direct and indirect volumerendering. Direct volume rendering projects the entire dataset onto animage-plane or frame buffer. Indirect volume rendering extracts surfacesfrom the dataset in an intermediate step, and these projected surfacesare approximated by triangles and rendered using the conventionalgraphics hardware. Indirect volume rendering, however, only allows aviewer to observe a limited number of values in the dataset (typically1-2) as compared to or all of the data values contained therein fordirect volume rendering.

Direct volume rendering that is implemented in software, however, istypically very slow because of the vast amount of data to be processed.Moreover, real-time direct (interactive) volume rendering (RTDVR)involves rendering the entire dataset at over 10 Hz, however, 30 Hz orhigher is desirable. Recently, RTDVR architectures have become availablefor the personal computer, such as VolumePro, which is commerciallyavailable from RTVIZ, a subsidiary of Mitsubishi Electronic ResearchLaboratory. VIZARD II and VG-Engine are two other RTDVR acceleratorsthat are anticipated to be commercially available. These acceleratorsmay lower the cost of interactive RTDVR and increase performance overprevious non-custom solutions. Moreover, they are designed for use inpersonal computers. Previous solutions for real-time volume renderingused multi-processor, massively parallel computers or texture mappinghardware. These solutions are typically expensive and not widelyavailable due to, for instance, the requirement for parallel computers.Alternatively, these solutions generate lower quality images by usingtexture-mapping techniques.

Although accelerators have increased the availability and performance ofvolume rendering, a truly general-purpose RTDVR accelerator has yet toemerge. Current accelerators generally support parallel projections andhave little or no support for perspective projections and stereoscopicrendering. These different projections are illustrated in FIG. 2.Stereoscopic rendering is a special case where two images, generally twoperspective images, are generated to approximate the view from each eyeof an observer. Stereoscopic rendering typically doubles the amount ofdata to be processed to render a stereoscopic image. Moreover, currentaccelerators also require high memory bandwidths that can often exceed 1Gbyte per second for a 256³ dataset.

Furthermore, these current accelerators are typically either image-orderor object-order architectures. An image-order architecture ischaracterized by a regular stepping through image space and theobject-order architecture is characterized by a regular stepping throughobject space. Image-order ray casting architectures may supportalgorithmic speed-ups, such as space leaping and early ray termination,and perspective projections. Object-order architectures tend to providemore hardware acceleration and increased scalability. Object-orderarchitectures, however, have not generally provided algorithmicacceleration. The trade-off between these various limitations aretypically either (i) good parallel rendering performance and no supportfor perspective projections or (ii) good algorithmic acceleration andlittle hardware acceleration and vice versa.

The voxel-to-pipeline topologies of typical image-order and object-orderaccelerators are shown schematically in FIGS. 3 and 4, respectively.Image-order architectures must access several voxels from a volumememory per processor. This typically causes a bottleneck in achievablehardware acceleration and thereby limits the number of usefulprocessors. For example, as illustrated in FIG. 3, a typical image-orderarchitecture has an 8-to-1 bottleneck for each image-order pipeline.Although algorithmic acceleration for the reconstruction,classification, shading and the composition of the voxels can oftenincrease performance, such an increase in performance is oftenoutweighed by the voxel bottleneck in the memory system, therebylimiting the overall acceleration.

As depicted in FIG. 4, object-order pipelines generally require only onevoxel access per processor thereby providing greater hardwareacceleration due to the lack of a voxel or a memory bottleneck.Object-order reconstruction of the dataset, however, makes it difficult,if not impossible, to implement algorithmic acceleration or supportperspective projections.

Neither image-order nor object-order architectures are general-purposetechniques because of their limitations. For example, image-orderarchitectures only deliver interactive performance for certain types ofdatasets by relying heavily on algorithmic acceleration. Performance canbe extremely sensitive to viewing parameters (and datasetcharacteristics) potentially causing large fluctuations in performance.On the other hand, object-order architectures yield more consistentperformance but typically do not support perspective projections. As aresult, these architectures cannot be used for applications that requirestereoscopic rendering, virtual reality, computer graphics, computergames and fly-throughs.

Thus, there is a need for a device capable of general-purpose volumerendering performance that supports interactive rendering for bothparallel and perspective projections. Furthermore, there is a need for ageneral-purpose device that supports interactive rendering forstereoscopic displays.

SUMMARY OF THE INVENTION

The present invention is a general-purpose device that supportsinteractive rendering for parallel and perspective projections andstereoscopic rendering thereof. The general-purpose device is furthercharacterized as a digital electronic system for real-time volumerendering of a 3D volume dataset. A new hybrid ray casting is used tovolume render a real-time image from external memory. Volume renderingincludes reconstruction, classification, shading and composition ofsubvolumes or voxels of a volume dataset representing the 3D image.Early ray termination and space leaping accelerate the processing of thevoxels by dynamically reducing the number of voxels necessary to renderthe image. Furthermore, the underlying hardware of the present inventionprocesses the remaining voxels to in an efficient manner. This allowsfor real-time volume imaging for stereoscopic displays.

The hardware architecture of the present invention supportsprojection-guided ray casting, early ray termination and space leapingfor improved memory usage. The hardware architecture further acceleratesthe volume rendering due, in part, to regular and predictable memoryaccessing, fully pipelined processing and space leaping and buffering ofvoxels to eliminate voxel-refetch.

The incorporation of the projection guided ray casting, including earlyray termination and space leaping, and the hardware architecture permitrendering of the image where the rendering is not the criticaltime-consuming operation. In other words, the present invention canrender many volumes in a faster time period than the entire volumes canbe read from external memory.

Another aspect of the present invention includes a method for volumerendering an image where there is no substantial refetching of data fromexternal memory. Perspective projections, under certain circumstances,may require a minimal, but non-limiting, refetching of some data. Themethod includes early ray termination and space leaping accelerationsand the processing of voxels in predictable manner in hardware to volumerender an image in real-time.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a schematic depiction of a volume dataset for rendering animage.

FIG. 2 is an illustration of different projection useful for renderingan image.

FIG. 3 is a schematic illustration showing voxel-to-pipeline topology orprocessor of an image-order accelerator.

FIG. 4 is a schematic illustration showing voxel-to-pipeline topology orprocessor of an object-order accelerator.

FIG. 5 is a schematic illustration of the projection-guided ray castingof the present invention.

FIG. 6 is a conceptual illustration of the projection-guided ray castingof the present invention.

FIG. 7 is a schematic illustration of a frame buffer initialization ofthe present invention.

FIG. 8 is a schematic overview of the hardware architecture the presentinvention.

FIG. 9 is a schematic depiction of data flow for processors of thehardware architecture of FIG. 8.

FIG. 10 is a schematic depiction of data flow for controller of thehardware architecture of FIG. 8.

DETAILED DESCRIPTION OF THE INVENTION

The system of the present invention is a digital electronic system,including hardware architecture, for real-time volume rendering of a 3Dvolume dataset. The system of the present invention maximizes processingefficiency while retaining flexibility of ray casting by selectingimage-forming voxels, such as non-transparent and non-occluded voxels,for further processing and minimizing the processing requirements orrejecting non-image-forming voxels, such as transparent or occludedvoxels.

Desirably, the system of the present invention (1) sustains burst memoryaccesses to every voxel, (2) constantly accesses voxels from the memorysystem, (3) does not fetch voxels from the memory system more than onceand (4) allows for early-ray termination and space leaping. Sustainingburst memory accesses to every voxel is accomplished, in part, by havingeach set of voxels being accessed in a regular manner based on thedesired virtual viewing position. The number of voxels in the set isdictated by the minimum burst length required to hide the latency of thedynamic random access memory (DRAM) device. The constant access ofvoxels requires, in part, that the set of voxels be processed in apredictable order so that the correct voxels can be prefetched frommemory. This allows fully pipelined rendering and eliminates delays oridle cycles in the hardware architecture. The elimination of refetchingis achieved, in part, by having each voxel's contribution to theimage-plane being determined when the voxel is accessed, therebyallowing the voxel to be discarded once it is processed. The lastcondition requires, in part, that rays be launched independently of eachother.

The system of the present invention may be included into a personalcomputer or similar device. Such a device will also typically contain ascreen for viewing the rendered graphic image, and typically containsmemory.

As described in further detail herein, the present invention includesprojection guided ray casting and hardware architecture for renderingreal-time images. The projection guided ray casting further includesearly ray termination and space leaping, which are discussed below infurther detail.

Projection Guided Ray Casting (PGRC)

The hybrid ray casting of the present invention is described asProjection Guided Ray Casting (PGRC) because it successfully merges thebenefits of the object- and image-order processing using hardwareacceleration and sample processing acceleration. Requiredmemory-bandwidth and computational-throughput for interactive volumerendering is reduced making it possible to render a dataset faster thanthe entire dataset can be read from memory.

In traditional ray casting, rays are cast through each pixel on theimage-plane. Samples inside of the volumetric dataset are reconstructedand rendered at evenly space intervals along each ray. Image-planetraversal is typically scanline-by-scanline, which gives rise to randommemory access of the volume dataset and multiple voxel refetches whichtypically thrash the volume memory resulting in poor hardware efficiencydue to idle memory cycles. Although the overall efficiency oftraditional ray casting may possibly be enhanced by algorithmicacceleration, the low hardware acceleration efficiency typically causesthe rendering performance to be slower than the reading of the datasetfrom memory. These aspects of traditional ray casting typically limitits performance.

A schematic and a conceptual illustration of PGRC are shown in FIGS. 5and 6, respectively. PGRC uses forward projections to enhance the memoryperformance of the ray casting. The dataset 30 is partitioned intohundreds or thousands of sub-volumes referred to as voxel access blocks32. Ray casting is applied to rays that penetrate these voxel accessblocks 32, when the voxel access-blocks are accessed from memory Sincethese voxel access blocks 32 are small, they project to a small portionof the image-plane. Only the small groups of rays that penetrate eachvoxel access block 32 are rendered. The PGRC iterates over each voxelaccess block 32 with a front-to-back processing thereof until the entiredataset 30 is processed. In PGRC virtually all voxel re-fetch iseliminated.

Forward projections that are used during PGRC may also used duringscan-conversion in traditional 3D polygon-based acceleration.Scan-conversion hardware is an integral part of personal computers andworkstations. Using a view transformation matrix that maps fromobject-space to image-space, each vertex can be projected onto theimage-plane. The polygon is filled with a color and/or texture(texture-mapping). In PGRC, these conventional scan-conversioncomputations along with a front-to-back processing of voxel accessblocks 32 are used, in part, to eliminate memory thrashing in theray-casting algorithm.

Referring to FIG. 5, at step 10 a view transformation matrix is computedbased on the desired view or perspective. A frame buffer is initializedwith the entry-point of each ray into the dataset 30. At step 12, acubic set of voxels or the voxel access blocks 32 are selected andprocessed in front-to-back order. Voxel access blocks 32 are a “b×b×b”array of voxels as shown in FIG. 6. At step 14, eight voxels on thecorner of a voxel access block 32 are each projected onto the imageplane 34 using a view the transformation matrix or forward projectors36, as depicted in FIG. 6, forming a 2D footprint 40 in image-space. Atstep 16, the pixel access blocks 42, which contain the forward projectedimage, is bound to complete the creation of the 2D footprint 40 in imagespace.

At step 18, rays of backward projectors 38 are then cast through eachpixel that lies on or within this 2D footprint 40. At step 20, thesegment along each ray that penetrates the voxel access block 32 iscomputed. Upon exiting the voxel access block 32, the rays are writteninto a frame buffer 35. The new state (color, opacity, and position) ofthese rays is stored at step 22 as a pixel inside of the frame buffer35. The above steps are repeated for each voxel access block 32 infront-to-back order until every voxel access block has been processed.

As depicted in FIG. 7, the initial intersection of each ray 48 with thedataset 30 is stored into frame buffer 35 along with its X, Y and Zincrement vector. The opacity and color values are initialized to zerofor the entire frame buffer.

Voxel access blocks are processed from front-to-back order to allowearly-ray termination. Since front-to-back ordering depends on aparticular view position and view direction, which are known prior torendering, the next voxel access block is prefetched allowing fullypipelined operation in hardware. The direction of projection can bedetermined from the viewing parameters. It is a vector pointing from thecenter of projection towards a viewer. The eight corner voxels of eachvoxel access block 32 are projected onto the image-plane 34. Theresulting vertices are mapped into image-space using a viewtransformation matrix.

The eight projected vertices form a convex region in image-space arethen filled using well-known scan-line algorithms. The filling processdetermines the pixels (i.e., rays) that lie within the 2D footprint 40of voxel access block. As a result, only the exact rays that are neededare cast.

As discussed above, ray casting is applied to each ray from the true 2Dfootprint of the voxel access block. In practice, however, clippingregions are projected onto the image-plane instead of the voxel accessblock boundaries. Clipping regions are a function of the front-to-backordering and type of projection. Clipping regions represent portions ofa projected voxel access block near a projected ray and these clippingregions are processed for image rendering. Clipping regions are bothtranslated by and enlarged so that the clipping region coincides withdata in the internal buffers. The clipping regions are enlarged by oneto handle reconstruction computations, such as interpolation andgradient computations, in the proximity of an intra-block space.

Each pixel in the frame-buffer contains the state of the ray thatpenetrates it. Using an increment vector and the sample location of theray, a segment of the ray is rendered until it exits the voxel block'sclipping region. For perspective projections, the clipping regionclosest to the viewer is accessed first.

Early Ray Termination and Space Leaping

The PGRC algorithm directly supports early-ray termination and spaceleaping. Both of these are “algorithmic acceleration” techniques becausethey reduce the amount of computation for rendering an image.Conceptually, early-ray termination selects non-occluded voxels forfurther processing and rejects occluded voxels from further processing.The dataset is not tested over all samples the viewing parameterdictates supersample. Because of the fully pipelined design, voxelaccess block memory accesses are overlapped with the processing ofanother voxel access block; therefore, there is no performance benefitin completing a voxel-block early unless the voxel-block issupersampled. During supersampling, however, the memory system isdelayed for a length of time proportional to the sample-to-voxel ratio.Early-ray termination reduces or eliminates these delays.

Using early-ray termination, every voxel access block inside of thedataset is accessed only once. Therefore, the peak performance is equalto the rate at which the entire dataset can be read from memory. Sinceone goal of the present invention is to render the dataset faster thanit can be read from memory, a more aggressive data processingacceleration technique is used that allows the skipping of the memoryaccess to entire voxel access blocks.

Space leaping can provide substantial acceleration for many datasetsespecially medical datasets where the regions of interest are typicallynear the center of the volume and there is a lot of empty space. Spaceleaping skips, or leaps over, transparent regions and requires eitherexplicit or implicit distance information. The dataset is preprocessedand the distance to the next non-transparent sample along a ray isstored for each voxel inside the dataset. Encoding a distance at eachvoxel requires added memory and preprocessing overhead. In the presentinvention the additional memory requirements are minimized or reduced.Distances are encoded for a group of voxels thereby reducing the overallleaping distance which lowers memory requirements while only slightlyreducing the acceleration achievable through space leaping.

Using implicit distance information, regions inside of the dataset areflagged transparent or non-transparent. When a ray advances to atransparent region, the ray can be quickly incremented through theregion, taking into consideration the orientation and size of theregion. This method has advantages over explicitly storing distances.For example, this method uses much less memory, for instance a singlebit per region. Moreover, preprocessing involves simply comparing eachvoxel inside of the region to a user-defined threshold and this can becomputed on-the-fly. Desirably, implicit distance information is used toleap over empty regions.

The volume data is first rendered as described above. As the dataset isrendered, each voxel contained a voxel access block is compared againsta user-defined transparency threshold. If every voxel is below thethreshold, then the voxel access block is flagged empty in a smallbinary lookup-table. This table is called an empty voxel access blocktable. After the first image is rendered, the table can be applied tosubsequent images until the dataset or user-defined transparencythreshold is altered. Desirably, the empty voxel access block table ischecked before accessing a voxel-block from the volume memory. In orderfor a voxel access block to be skipped, the voxel access block and its26 neighbors must be transparent. The 26 neighbors are required to betransparent because of the way voxels are buffered and the clippingregions are translated. If the entire neighborhood of voxels is empty,any ray in the clipping region can be incremented by a dimension, b, ofthe voxel access block, regardless of the direction of the incrementvector. Thus, perspective projections are supported by the presentinvention. Furthermore, the time to process a voxel access block isreduced. One benefit of this acceleration is that the overhead ofcomputing the empty voxel access block table is completely hidden byuseful work.

Hardware Architecture

The hardware architecture of the present invention is called Resampleand Composite Engine (RACE) and is a hardware engine for, among otherthings, accelerating real-time volume rendering of a graphic image byhaving image-forming voxels available for processing without having torefetch a substantial number of voxels from external memory, such as thememory contained within a personal computer. An overview of the hardwarearchitecture is described below, followed by a description of the dataflow for the processors and the controllers of the present invention.

An overview of this hardware architecture is shown schematically in FIG.8. The hardware architecture 50 contains a control unit 52 and aplurality (p+1) processors 54. Each processor 54 contains a renderingunit 56, a volume memory 58 and a pixel memory 60.

The control unit 52 implements, among other things, object-orderprojection to control memory accesses to the voxel and pixel memories.The rendering units 56 implement the image-order ray casting,voxel-buffering and clipping. The control unit 52 providessynchronization for each processor 54 and generates memory addresses forall transactions on both the voxel memory bus 62 and the pixel memorybus 64. The volume memory 58 stores the data volume. The pixel memory 60stores the color and the current state of each ray during the raycasting process.

The RACE architecture partitions the dataset into thousands ofsubvolumes or voxel blocks. In multiprocessor RACE configurations, eachsubvolume is equally divided among each processor 54. As the voxels arestreamed into the processors 54 from the volume memory 58, they arequickly distributed among processors using local communication. Eachprocessor 54 has a dedicated connection to the volume memory 58. Voxelsfrom other processors are distributed using local neighbor-to-neighborcommunication in a circular faction.

With a “p+1” number of processors 54 in the system, after p+1 clockcycles, each processor 54 contains a local copy of the voxel-block. Thisallows fast random interpolation from high-speed internal SRAM memories.This is important for supersampling and for discrete ray-tracingarchitectures. Central differences at grid-points are computed on thisfixed stream of voxels and stored into a gradient buffer. Alternately,voxels can be stored in a quad-ported SRAM allowing gradients to becomputed directly from adjacent samples. This alternate method, however,requires more memory addresses to be generated. The size of thebuffer-memory is proportional to the resolution of the voxel-block.Because each voxel gets forwarded to other processors, memorypartitioning is not critical and low-order interleaving to distributethe volume may be used. Interleaving allows accesses for each memorymodule to share a single memory address. Voxel-blocks that have at least8(p+1) voxels can be stored in contiguous memory locations orinterleaved groups of eight voxels between internal memory banks toguarantee peak DRAM memory performance.

The rendered image is written into the pixel-memory 60. Each pixelstores the color, opacity, position and increment vector for a ray thatpenetrates it. The depth of each pixel in the frame-buffer isapproximately twice the depth of pixels used in modern polygon-basedaccelerators. Modern 3D polygon-based accelerators store color, alpha,z-buffer, and stencil information per pixel using anywhere from 6-8bytes of data. In the context of volume rendering, doubling the depth ofthe frame-buffer is reasonable because memory capacity is dominated bythe volume buffer. As an example, frame-buffer capacity is typically 4MB to 16 MB whereas 3D datasets often require 32 MB to1 GB of storagecapacity. The current trend in medical and scientific visualization ishigher resolution datasets that consistently require over 128 Mbytes ofmemory storage. In the present invention each pixel memory also respondsto a single memory address using low-order image interleaving. The framebuffer is partitioned equally among processors. The least significantbits of the pixel position dictates which processor owns the pixel.Low-order interleaving enhances load balancing between processorsbecause of spatial coherence.

Before rendering starts, the RACE frame buffer is initialized with acolor, opacity and the ray's entry position into the volume dataset orat the front-clipping plane. For perspective projections, the incrementvector per ray is stored into the frame buffer. A slope-per-ray is onlystored for perspective projections. For parallel projections, a registerinside of the processor stores the increment vector and is updated onceper projection. During shading, 3D accelerators interpolate valuesacross the face of polygons. Typically, a color intensity (Gouraudshading) or a normal (Phong shading) is interpolated. To initialize theframe buffer, the color components of a voxel are assigned to be theactual position of the voxel for use in the Gouraud shading model. Forparallel projections, the three visible faces can be then rendered aspolygons to initialize the frame-buffer. For perspective projections,the view position is subtracted from each position and normalized todetermine the increment vector. Since these calculations are 2D andperformed once per projection, they will not cause a bottleneck in the3D volume rendering performance.

The controller 52 generates addresses for the volume memory 58 andpixel-memory 60. Addresses for the volume memory are determined by thefront-to-back ordering of the voxel access blocks and this ordering isbased on user-defined viewing parameters. The controller 52 stores theempty voxel access block table that allows skipping of transparent orundesired subvolumes. Before issuing a memory access for a voxel accessblock, the controller 52 first checks the empty voxel access block tableto determine if the block and its 26 neighboring voxel access blocks aretransparent. If so, the controller 52 advances to the next voxel accessblock in front-to-back order and repeats. If the voxel access block orany of its 26 neighbors are not empty, the controller 52 generates theappropriate memory addresses for the DRAM memory.

For each voxel access block, the controller 52 computes a correspondingclipping region based on the front-to-back ordering. The 2D footprint ofeach clipping-region is determined using the view transformation matrix.The view transformation matrix is applied to each corner of theclipping-region. A bounding box in image-space is computed based onminimum or maximum coordinates thereof or, alternatively, scanconversioncan be used to compute a footprint. The footprint is rounded topixel-block boundaries. The controller 52 issues a memory address foreach pixel-block inside of the footprint. The frame buffer responds bydelivering an array of pixels. These pixel-tiles can be stored incontiguous memory locations on a DRAM page or interleaved between memorybanks such that they can be accessed at the peak speed of the memorysystem.

The processors 54 perform the image-order ray-casting algorithm,voxel-buffering, and clipping to the local clipping region and globalview-frustum. Each voxel from the processor's dedicated pixel memory 60is streamed into internal buffers. Voxels 64 from other volume memorymodules are streamed in from the right neighbor. The processor 54 alsoforwards voxels 64 to its left neighbor. The entire sub-volume isdistributed to each processor 54 in a circular fashion usingneighbor-to-neighbor communication. Therefore, each processor 54receives “p” voxels per clock-cycle, i.e., one from its dedicated memorysystem and “p−1” from its right-neighbors. Conceptually, this is thesame as connecting all memory modules to every processor, however, tolimit the fan-out on the memory bus, voxels are forwarded to neighboringprocessors. This increases the pin-out of the application-specificintegrated circuit (ASIC).

Each of the “p” voxels is written to appropriate internal slice or voxelblock buffers inside the rendering unit. Voxels are buffered toeliminate duplicate accesses to the volume memory, and this allows forreconstruction near the gaps between voxel blocks. Two slices of voxelsare buffered for interpolation and gradient computation in each of theadvancing directions. The first slice is necessary to interpolatesamples that lie in between adjacent subvolumes. The second slice isneeded to interpolate samples on the advancing faces of the previousblock. Also, a slice of central difference gradients are buffered. Thevolume-slice buffers will dominate on-chip storage.

Processor Data Flow

FIG. 9 is a schematic illustration of the data flow for the processors54. Each processor 54 receives a stream of pixels (rays) 70 from theframe-buffer and queues them in an input queue 72. Each ray 70 enteringthe input queue 72 is stamped with a tag (pixel-block address) andoffset (relative position inside of the pixel-block). Each 2D footprintis delimited by a start-of-footprint (SOF) and end-of-footprint (EOF)flag so that the processor 54 can match clipping-regions to rays(pixels). In addition, a space-leap (SL) flag is used to determine ifthe ray can skip over the clipping region without rendering. Thesestamps originate from the controller 52.

Rays read from the input queue 72 are loaded into a new ray register 74.The following fields in the ray register 74 are checked: EOF/SOF flags,opacity threshold, SL flag, and position. EOF/SOF flags are used tosynchronize (or switch) clip-regions. The opacity threshold is used toprevent the rendering of occluded samples, i.e., early ray termination.Conversely, the SL flags prevent the rendering of transparent samples.The ray's position is examined to see if it lies within the activeclip-region.

Ray's that are not opaque, clipped, or skipped are sent to the acceptqueue 76 to be rendered all other rays take a second path (or clippath). Along the clip-path, if SL flag is set and the ray-position wasnot clipped, then the position is incremented (space-leaped) through theclip region. Then, these rays are written to the appropriate line insideof the pixel-cache.

After exiting the accept queue 76, least significant bits from the x-,y-, and z-ray positions are used to address the voxel and gradientbuffers. The fractional components are used as weights for the trilinearinterpolations. The color, opacity, position and increment vectorproceeds through the ray-casting pipeline. A ray interleaving unit 78interleaves rays from the accept queue 76 onto the inputs of image-orderray caster 77. Ray interleaving is used to eliminate data hazard due topossible feedback in the composition calculation. The ray interleaveunit 78 coordinates that two consecutive (or adjacent) samples along thesame ray are at the output of the shader stage and the output of thecomposition stage. This guarantees that two samples along the same rayare blended together

The rendered ray is added into the pixel-cache 82. No cache misses arepossible on this path because each ray that is added to the accept queue76 gets a reserved cache-line. Otherwise, it is not loaded into theaccept queue 76 until a cache-line becomes available. Each write-accessto a cache-line increments a counter for the corresponding cache-line;it can be determined when the cache-line (i.e., pixel-tile) is completeand ready to be written to the frame-buffer.

Once complete, the entire cache line is serially added to an outputqueue 83. Then, the valid bit and write counter for the cache-line iscleared. Whenever the output queue 83 is not empty, the processor 52sends a write-pending flag to the controller. When the pixel-bus becomesinactive, the controller issues a write acknowledge causing thepixel-block to be streamed from the output queue 83 onto the pixel-bus.In a multiple processor configuration, the controller must receive apending flag from each processor before releasing the pixel-bus. Formost of this analysis, the terms pixel and ray are completelyinterchangeable since only one ray penetrates a given pixel.

The voxel buffer logic is responsible for generating central differencegradients and storing voxels at the correct locations in the internalstatic-RAMs (SRAM). There are four types of buffer memories:voxel-block, block-slice, beam-slice and volume-slice. One set ofbuffers store voxels and another set stores central differences aton-grid positions. Central differences are computed as the voxel-blockis streamed into the processor. When accessing the buffers forinterpolation, gradient buffers and voxel-block buffers respond to asingle memory address. Each buffer is an eight-way interleaved SRAM toprovide the necessary voxel values to reconstruct the sample value andeach component of the gradient in parallel.

Two voxel slices and one gradient slice are buffered in each advancingx, y, and z direction. These buffers are double-buffered to allow accessto a previous slice and to update the next slice for subsequentvoxel-blocks. Front-to-back ordering proceeds beam-by-beam thenslice-by-slice. As a result, these slices will dominate on-chip storagerequirements. In general, architectures that seek to eliminatevoxel-refetch must buffer slices unless smaller reconstruction kernelsare used for samples near a slice boundary.

To reduce memory, the slice of gradients can be eliminated by bufferinga third slice of voxels and re-computing central differences for thisparticular slice. Desirably, the slice of gradients is buffered tosimplify computation.

Various methods can be used to remove or reduce the size of thevolume-slice buffer, including, but not limited to, storing thevolume-slice memory in off-chip memory or pixel memory, rendering thedataset in sections and prebuffering. When the volume-slice memory isstored in the frame-buffer having a wide connection, the volume-slicebuffer could be completely eliminated. In the RACE architecture, thepixel interface is wider than the voxel interface (e.g., 16 bytes).Therefore, these slices can be quickly loaded from the pixel memory.Each processor accesses the volume-slice from their dedicatedpixel-memory.

To reduce the size of the volume-slice buffers, the dataset can berendered in sections. The volume-slice buffers are inverselyproportional to the number of sections used. Voxels residing on/near aboundary of a section are re-fetched from the volume memory slightlylowering performance. Any face of a voxel-block can potentially lie onthe boundary of a section. As a result, the memory accesses to any ofthe six faces may cross DRAM-page boundaries due to our low-orderinterleaving scheme. Alternately, the voxel-block can be organized suchthat boundary block-slices can be retrieved conflict-free from anydirection using a skewed memory organization.

Auxiliary voxel-buffers (beam-, block- and volume-slice) may beeliminated by accessing a voxel-block and boundary voxels fromneighboring voxel-blocks each time the block is accessed. This method isa prebuffering method because the dataset can be reorganized during aquick preprocessing stage which combines each voxel-block with asurrounding shell of voxels inside of the memory (increasing memorycapacity). This creates self-contained blocks that have all of thenecessary information to reconstruct samples that lie in a(b+1)×(b+1)×(b+1) subvolume; however, the buffers must be(b+3)×(b+3)×(b+3) in size. Therefore, this method will lower performanceby introducing some duplicate memory access to the volume memory,especially for small-blocks. It has the advantage of simplifyinginternal buffering logic and reducing the number of separatelyaddressable buffers from four to one for the interpolation and gradientmemories. These buffers are internally eight-way interleaved.

Moreover, because of the block processing utilized by the RACEarchitecture, higher-order gradient filters can be used withoutincurring a performance penalty. Gradient encoding or lookup-table basedgradients can also be incorporated into the architecture. The logic thatconverts the stream of voxels into central differences at on-gridlocations can be and replaced by lookup-tables containing gradientcomponents.

After the gradient and interpolation computations, the interpolationvalue is used to index the classification tables for the red, green,blue and opacity transfer functions. Optionally, the gradient magnitudemay be used to modulate the opacity function. This highlights surfaceboundaries and increases the transparency in homogeneous regions of thedataset. The gradient magnitude computation requires a computationallyexpensive square root operator. It can be approximated using the norm ofthe gradient vector or using iterative numerical methods.

The pixel cache serves several purposes, including retiring two raysevery clock cycle, i.e., one skipped (or clipped) and one rendered,synchronizing the pixel-blocks with the controller and completingout-of-order pixel-block.

Each ray entering into the RACE pipeline takes one of two paths: acceptpath (path #1, for rendering) or the algorithmically skipped/clippedpath (path #2, little/no processing). Path #1 processes ray segmentsthat are not algorithmically eliminated and lie inside of theclipping-region; therefore, they must be rendered. Each of these raysare loaded into the accept queue 76.

Along the first path, all rays are rendering using the conventionalray-casting algorithm until they exit the clipping-region. Once theyexit, rays are written to the current cache-line or the next sequentialcache-line, i.e., pixel cache. No cache misses occur along this path;because, a cache-line is reserved before the ray enters path #1 and thecache-line is not discarded until the all rays from the cache-line hasbeen processed.

Path #2 handles two cases: the segment of the ray is algorithmicallyeliminated (skipped/occluded) or the ray's current xyz position isoutside of the voxel-blocks clipping region. Along Path #2, theClip-and-Add Unit 80 increments the ray's position if the SL flag is setand the ray is inside of the current (space-leapable) clip-region. Thisadder increments the ray position by a distance of b in the ray'sprimary direction. This quickly advances the ray through an emptyvoxel-block. This allows the ray-position to be incremented by anotherray-position that is exactly one voxel-block in the major viewingdirection along the ray with a single increment. Also, by limiting thenorm to be a power of two, each component of the increment vector isscaled using a shift-register.

After exiting the clip-and-add circuitry 80, rays are written to thepixel cache 82. If a cache-hit occurs on the current cache-line, the rayis written at the appropriate address in the cache line. The currentcache-line is indicated by a pointer to the cache. This cache utilizesthree pointers: two write pointers for the Path #1 (render) and Path #2(skip/clip). Data is read from the cache from a single read pointer andloaded into the output queue 83. Each pointer increments sequentiallythrough the cache.

The pixel cache 82 is direct mapped to a pointer that indexes the cacheand not the pixel address. As a result, only one tag compare isnecessary regardless of the size of the cache. No tag comparison isnecessary for the read-port of the cache. The read ports cycles througheach cache-line waiting the write counter to expire before advancing.

If a cache-miss occurs on the path #2, the clip pointer is incrementedby one to the next cache-line. Cache misses can only occur for the firstpixel inside of a pixel-block. If next cache-line is marked valid, thenthe clip logic halts all registers between the Input Queue along theclip-path until the line becomes invalid. Once the line becomesavailable, the line is marked valid and the ray's tag is stored on thecache-line. Then, the ray's color, position and increment vector arewritten into the cache. Cache-lines are marked invalid after the fullnumber of write operations have occurred to a single cache-line and theentire cache line has been transferred into the output queue 83. Thepixel-block is not retired until the cache-line is indexed by the readpointer. Each ray on the cache-line is then transferred into the outputqueue 83.

In multiprocessor implementations, the pixel-blocks are evenlypartitioned among each processor. The size of the cache-line and thetermination write-count are inversely proportional to the number of RACEprocessors. A benefit of this dual-path approach is that two rays cancomplete on single clock cycle. Furthermore, it allows the majority ofthe pixels that lie outside of the true-footprint but within thebounding-box to be clipped without causing additional stalls in theimage-order ray casting pipeline.

Because sequential pointers index the cache, pixels from the samepixel-block but residing in different processors are written to the samerelative cache-line in the corresponding processor. The sequential readpointer guarantees that pixel-blocks are retired in the same order thatthey are reserved. This provides synchronization with the controller. Asa result, the controller can resynchronize the pixel-blocks amongmultiple processors before they are written over the pixel-bus. Thecontroller simply waits for each processor to generate a write pendingsignal. After a cache-line is transferred to the output queue 83, theread pointer is incremented to the next cache-line in a circularfashion.

If the output queue 83 is not empty, a flag is sent to the controller toindicate a write pending status. If the queue is full, a criticalwrite-pending status flag is sent to the controller. Once the controllerreceives at least a write pending status from each processor and thepixel-bus is inactive, it sends a write acknowledge signal to eachprocessor. In turn, the output queue 83 responds by placing pixelsserially onto the pixel-bus in a first-in-first-out (FIFO) sequence.

Controller Data Flow

A dataflow for the RACE controller 52 is illustrated in FIG. 10.Front-to-back ordering generates a sequence of voxel-blocks to beaccessed from the DRAM memory. These voxel-blocks can be accessed frommemory using one or more volume memory addresses based on the size ofthe voxel-block, b, the DRAM page-size, and DRAM burst size needed tohide latency. The controller 52 is responsible for setting up both readand write memory transfers to the pixel-memory. As the controller issuesmemory addresses to the frame-buffer, it records the history of theprevious, h, memory addresses in a queue called the history queue 90.The maximum number of pixel-blocks that can be processed (or issued) ata given time limited by either the minimum of the history queue size orthe number of pixel-blocks that can be stored in the internal buffers(queues and caches) inside of the RACE processor.

When the history table 92 becomes full, the controller 52 stopsprocessing the footprint until a pixel-block is retired. The historyqueue 90 generates the correct write address when it is time to retire apixel-block. The history table 92 prevents the accessing of pixels thatare already rendered and is a random access copy of the pixel-blockaddress. Each pixel-block entry in the table has a valid/invalid flag.Before any pixel-block is issued to the pixel-memory controller, thepixel-block address is checked to see if it is already being processed.If so, the RACE controller halts the pixel-block access until apixel-block is retired. Note that this mechanism can potentially be usedto re-issue the pixel-block internally inside of the RACE processorenhancing performance. When the controller acknowledges a write request,one pixel-block entry is simultaneously retired from the history queue90 and history table 92.

The front-to-back generator is a simple three-digit counter that countsvoxel-blocks. Voxel blocks are counted beam-by-beam then slice-by-sliceuntil each block in the data volume has been visited.

If a block is clipped, the block is discarded. As a result, the blockdoes not consume any throughput on the voxel-bus or pixel-bus. If theblock is not clipped, the 3D empty block table is checked to determinewhether or not the current voxel-block and its 26 neighbors aretransparent. If so, the block is flagged as empty. For synchronizationpurposes, the block is loaded in the volume memory access queue 94 and aDRAM memory access is not generated. Instead, the block's clippingregion is forwarded to each processor and it is used to clipspace-leaped rays. The empty block is also loaded into the footprintqueue 96. Once the block reaches the head of the footprint queue 96, itsclipping region is projected onto the image plane.

If the voxel-block is not tagged empty, it is issued to the volumememory controller 98 once it leaves the volume memory access queue 96.The controller waits until previous voxel-block access is completebefore issuing the next voxel-block.

As blocks exit the footprint queue 96, they are mapped from object-space(xyz) to image-space (uv) using the view transformation matrix. Once theu and v coordinates are computed for each corner of the voxel-block, thefootprint of the voxel-block is computed in image-space. In conventionalgraphics accelerators, a precise scanline algorithm is used to computethe footprint (i.e., projected area) of primitives in image-space.Alternately, the RACE controller using a simple bounding boxapproximation of the 2D footprint thereby eliminating the need forscan-conversion hardware. Since each ray must be clipped against thecurrent 3D voxel-block, the true 2D footprint is determined inside theprocessor. By proceeding center outwards, the controller quicklygenerates a workload for the RACE rendering pipelines by placing rayswith longer paths into the queue first. This leads to less sensitivityto fluctuations on the pixel-bus and fewer wasted clock cycles in thepipeline.

The controller checks handshaking signals from the processor todetermine whether or not each processor is ready to receive apixel-block. This signal indicates the near-full state of the inputqueue 72. If each processor is not ready, the controller halts theprojection unit until each processor is ready. In addition, the historytable 92 is checked to determine if the pixel-block is currently in-useby the RACE processors. The history table 92 records all of thepixel-blocks inside of the history queue 90. The history queue 90 keepsthe correct ordering of pixel-blocks that are being rendered andprovides necessary synchronization for write operations on thepixel-bus. Once each processor indicates a write-pending status, thecontroller issues a write acknowledge signal when the pixel-bus becomesavailable. The write request signal indicates that data resides in aprocessor's output queue 83. Each processor responds by placing pixelsonto the pixel-bus. The combination of the history queue 90 and pixelcache 82 provide synchronization for write operations. The sequentialread pointer that is used to index the pixel cache 82 guarantees thatthe pixel-blocks are retired in the same order they are read. Memoryaddresses from the history queue 90 are used to generate the writeaddress for each pixel write operations. When an address is removed fromthe history queue 90,the entry is also cleared inside of the historytable 92.

The controller 52 is also responsible for generating memory addressesfor the frame buffer and the volume memory. Furthermore, the controller52 keeps each engine operating in a fully pipelined manner.

The following example is provided to further illustrate thearchitectures and methods of the present invention for real-time volumerendering of images. The example is illustrative only and is notintended to limit the scope of the invention in any way.

EXAMPLES Example 1

The resample and composite engine architecture was simulated in softwareusing a C++ clock cycle simulator. The simulator conservatively assumedthat the pixel memory bus operated at the same rate as the voxel memorybus and that the entire dataset lies within the view volume. Inpractice, embedded DRAM technology can be used for the relatively smallpixel memory to enhance performance. Voxel-blocks sizes were variedbetween 64(4³)−32768(32³) voxels. Pixel-tiles were sized to accommodate16 pixels per processor. For example, if 4 processors are simulated apixel-tile containing 64 pixels are used. This allowed the Resample AndComposite Engine to hide the memory latency when accessing thepixel-memory.

Each processor was configured as follows: the Input Queue could store upto 128 rays, the Accept Queue could store up to 16 rays, the Pixel Cachecould store 128 rays, and the Output Queue could store up to 128 rays.The auxiliary on-chip storage required less than 10K Byte of memory.Voxel buffers were doubled buffered and required either 256, 2K, 16K or64K bytes of memory based on the block resolution, b. The internalslice-buffers dominated the on-chip storage and required 448K Bytes fora 256³ dataset.

The Resample And Composite Engine controller required less than 16 KByte of on-chip storage for the Opaque Voxel Block (OVB) table,Transparent Voxel Block (TVB) table and internal buffers. An 8-entrypixel-address buffer was used to record the pixel-tiles that were beingrendered by the resample and composite engine processors. This preventedthe reading of stale data from the frame-buffer. The performance of theresample and composite engine architecture was simulated for sixdifferent datasets. The datasets were rendered using a plausibleclassification mapping. For example, CT datasets were rendered with amapping of soft tissue to a semi-transparent value and bone to an opaquevalue. For each dataset, 26 (orthogonal, side and diagonal) viewpositions were used to estimate average rendering performance. Theperformance was then compared with the Data Access Rate (DAR), which isthe peak rate at which the entire dataset can be read from the memorysystem. These results are presented in the Table 1 below for a singleresample and composite engine processor operating at 100 MHz. In thisconfiguration, the resample and composite engine architecture used only

$200\mspace{14mu}\frac{MByte}{second}$of volume memory throughput.

From this table, the performance of the resample and composite enginearchitecture consistently outperformed the DAR rate for 8³−32³voxel-blocks when the dataset was larger than 128³. In particular,8³−16³ voxel-blocks delivered nearly a 75% increase in performance overthe DAR rate with peak performance exceeding 200% (i.e., 3.0 memoryefficiency). For small voxel-blocks, the number of pixels per footprintcan be greater than the number of voxels inside the voxel-block,therefore, the pixel bus can cause a bottleneck in performance.

A faster pixel interface allowed substantial gains in performance forsmall voxel-blocks (4³−8³) whose performance was limited by the pixelthroughput. Because embedded DRAM's enable increased pixel memorythroughput by a factor of 4 or more, this is a promising result. Eachray (or pixel) read from the frame buffer was also written, therefore,the read and write throughputs were identical. Small voxel-blocksconsumed less than the full bandwidth of the volume memory bus becauseof algorithmically skipped blocks. This feature is exploited in sharedmemory accelerators, such as accelerated graphics port (AGP), when thedataset is rendered directly from main memory.

The pixel-bus was not limiting performance for larger voxel blocks.Furthermore, the sharing of pixel interfaces between two or moreresample and composite engines can be potentially realized with only asmall penalty in performance.

The memory efficiency of the resample and composite engine architecturegenerally increased with an increase in dataset resolution. Comparingthe relative memory efficiency of a low resolution 64³ dataset and ahigher resolution 256³ dataset revealed more than a 100% increase for 8³voxel-blocks, as described in Table 1. This is because large datasetstended to have corresponding larger regions of non-image forming voxels.As a result, expected average performance for a resample and compositeengine architecture configured with 8³⁻16³ size voxel-blocks to exceedthe DAR rate by a factor 3 as dataset resolutions approach 512³.Colossal datasets will offer even more potential for accelerationbenefits resulting from the present invention.

TABLE 1 Simulation Results for a Single Pipeline Operating at 100 M Hz256 × 256 × 128 CT-head (Bone 256³ 64³ 128³ high-opacity, CT-engineCT-head (Bone Dataset Size Synthetic MRI-head tissue Semi- MRI-headhigh-opacity, tissue Voxel-block High-opacity High-opacitysemitransparent) transparent High-opacity semitransparent) (Hz) (Hz)(Hz) (Hz) (Hz) (Hz) (Hz) Data Access 381.47 47.68 11.92 11.92 5.96 5.96Rate  4³ 243.44 ± 106.70 44.34 ± 18.70 10.01 ± 4.89  7.50 ± 2.71  7.32 ±3.14 3.39 ± 1.54  8³ 403.08 ± 59.95  84.28 ± 16.71 19.27 ± 4.07 17.46 ±2.69 13.82 ± 2.73 8.81 ± 1.62 16³ 381.23 ± 0.28  66.20 ± 1.17  15.78 ±0.55 16.40 ± 0.31 10.39 ± 0.34 9.33 ± 0.26 32³ 381.46 ± 0.00  47.67 ±0.02  12.81 ± 0.10 12.11 ± 0.10  6.41 ± 0.04 7.93 ± 0.04

A 256³ MRI dataset with multiple resample and composite engineprocessors for parallel and perspective projections was also simulated.As expected, perspective projections delivered less performance due to aslight increase in the amount of voxel refetch. By using 8³−16³voxel-blocks, 20 Hz (15 Hz) performance was obtained for a 256³×16-bitdataset using only

$400\mspace{14mu}\frac{MByte}{second}$(i.e., two 100 MHz processors) of volume memory throughput and tworesample and composite engines for parallel (perspective) projections.Extrapolating these results to a 512³ dataset, the resample andcomposite engine architecture requires only

$3.2\mspace{14mu}\frac{GByte}{second}$of volume memory throughput for similar frame rates. Larger algorithmicspeedups are expected when the dataset resolution is increased. As aresult, the resample and composite engine allows next generation sizedatasets to be rendered interactively using similar volume memorythroughput that other solutions currently use to render smallerdatasets. For example, texture mapping engines offer less than 10 Hz for256³ datasets using more than

$3.\; 2\mspace{14mu}\frac{GByte}{second}$of volume memory throughput. The VG-engine and VIZARD II approaches willrequire approximately

$2\mspace{14mu}\frac{GByte}{second}$bandwidth for similar performance on a smaller dataset. In the RACEarchitecture, 16³ voxel-blocks offer the best combination of scalabilityand performance when the pixel-bus and voxel-bus operate at the sameclock frequency.

Various changes to the foregoing described and shown methods andcorresponding structures would now be evident to those skilled in theart. The matter set forth in the foregoing description and accompanyingfigures is therefore offered by way of illustration only and not as alimitation. Accordingly, the particularly disclosed scope of theinvention is set forth in the following claims.

1. A digital electronic system for real-time volume rendering of a 3Dvolume dataset comprising: a data-processing accelerator for reducing anumber of voxels for rendering an image in real-time by selectingimage-forming voxels that are non-transparent and non-occluded from aprojection and by rejecting non-image-forming voxels that aretransparent or occluded from the projection, wherein the voxels are avolume dataset of the image to be rendered contained in memory externalto the system; a control unit for forward projecting the 3D volumedataset at regularly spaced voxel positions to determine number of raysto be casted wherein said 3D volume dataset is divided into a pluralityof voxel access blocks having a cubic array of voxel; a processor forray casting the rays of the image-forming voxels in a front-to-backorder to form 2D representation of image planes; a hardware engine foraccelerating the real-time volume rendering by having the image-formingvoxels available for processing without having to refetch a substantialnumber of the voxels from the external memory; wherein the real-timeimage is rendered from the image-planes formed from the selected voxels.2. The system of claim 1 wherein the projection is a parallelprojection.
 3. The system of claim 1 wherein the projection is aperspective projection.
 4. The system of claim 1 wherein the projectionis a stereoscopic projection.
 5. The system of claim 1 wherein the raycasting includes early-ray termination and space leaping for selectingthe image-forming voxels, wherein the image-forming voxels arenon-occluded voxels and early-ray termination substantially avoidsoversampling of the occluded voxels.
 6. The system of claim 1 whereinray casting includes space leaping for selecting the image-formingvoxels, wherein the image-forming voxels are non-transparent voxels andspace leaping substantially avoids overprocessing of transparent voxels.7. The system of claim 1 wherein the hardware engine further comprisesvolume memory for storing a local copy of a small subset of the datavolume defining the voxels, a rendering unit for implementing the raycasting of the stored data volume and pixel memory for storing outputray data from the rendering unit from which the real time image is to berendered.
 8. The system of claim 1 wherein the hardware engine includesat least two processors and a controller synchronizes the processors. 9.The system of claim 8 wherein the data volume of neighboring voxels isdistributed between the at least two processors.
 10. The system of claim9 wherein data volume from one processor is distributed in a circularfashion to the other processor for interpolating image-cast rays. 11.The system of claim 10 wherein the volume memory is a high-speedinternal static or dynamic random access memory and each processor has adedicated connection the high-speed internal static or dynamic randomaccess memory.
 12. The system of claim 11 wherein the image can berendered from the hardware engine faster than all of the voxels in thevolume dataset can be read from the external memory.
 13. The system ofclaim 1 further comprising a personal computer containing the externalmemory.
 14. The system of claim 1 further comprising a screen forviewing the rendered real-time image.
 15. A method for rendering areal-time image comprising: retrieving a volume dataset from externalmemory; subdividing the volume dataset into a plurality of voxel accessblocks, wherein said voxel access blocks are a cubic array of voxels;storing the voxel access blocks in high-speed internal memory; forwardprojecting the voxels located at the corners of the block to determinenumber of rays to be casted, wherein said corner voxels correspond to aposition of said block; ray casting the rays in a front-to-back order toform a two-dimensional representation therefrom; reducing a number ofthe voxels for rendering an image in real-time by selectingnon-transparent voxels and non-occluded voxels and by rejectingtransparent voxels or occluded voxels wherein the voxels are the volumedataset of the image to be rendered contained in said external memory;processing the selected voxels to form pixels in a plurality ofprocessors having interleaved memories for processing and distributingthe voxels thereamong without having to refetch the voxels from theexternal memory; and rendering a real-time image therefrom.
 16. Themethod of claim 5 further including wherein the step of reducing thenumber of voxels further includes early-ray termination for selectingthe non-occluded voxels to substantially avoid oversampling of occludedrays.
 17. The method of claim 16 wherein the step of reducing the numberof voxels further includes space-leaping to substantially avoid theoverprocessing of the transparent voxels.
 18. The method of claim 16further including processing the pixels and the voxels in high-speedinternal random access memory to render the image therefrom faster thanthe step of retrieving the volume data set from the external memory. 19.A method for rendering a real-time image comprising: retrieving a volumedataset from external memory; forward projecting the volume dataset atregularly spaced voxel positions to compute number of rays/pixels to becasted, wherein the dataset is divided into plurality of voxel accessblocks having cubic array of voxels; ray casting the rays/pixels infront-to-back order visiting all voxel access blocks except fortransparent or occluded blocks without having to refetch the voxels fromthe external memory to form a 2D representation of image planes, whereinsaid image planes is a calculation of color, opacity and position of therays/pixels.
 20. A method for rendering a real-time image comprising:retrieving a volume dataset from external memory; subdividing the volumedataset into a plurality of voxel access blocks; storing the voxelaccess blocks in high-speed internal memory; forward projecting thevoxels located at the corners of the block to determine number of raysto be casted, wherein said corner voxels correspond to a position ofsaid block; ray casting the rays in a front-to-back order to form atwo-dimensional representation therefrom; reducing a number of thevoxels for rendering an image in real-time by selecting non-transparentvoxels and non-occluded voxels and by rejecting transparent voxels oroccluded voxels wherein the voxels are the volume dataset of the imageto be rendered contained in said external memory; processing theselected voxels to form pixels in a plurality of processors havinginterleaved memories for processing and distributing the voxelsthereamong without having to refetch the voxels from the externalmemory; and rendering a real-time image therefrom.
 21. A system forrendering a volume dataset, wherein the volume dataset includes aplurality of voxel blocks, wherein each of said voxel blocks includestwo or more voxels, the system comprising: one or more rendering units;a first memory configured to store said plurality of voxel blocks; acontrol unit, wherein, for each of said plurality of voxel blocks, saidcontrol unit is configured to: identify, by performing a forwardprojection, a portion of a frame buffer corresponding to the voxelblock; determine whether the voxel block is selected for transfer fromsaid first memory to said one or more rendering units, wherein saiddetermination is based upon whether said voxel block is transparent andwhether said voxel block is occluded relative to a current viewingposition; and transfer the voxel block from the first memory to said oneor more rendering units in response to said determination indicatingthat the voxel block is selected for transfer; wherein, for each voxelblock, said one or more rendering units are configured to process, infront-to-back order, a set of rays passing through the correspondingportion of the frame buffer, and wherein said one or more renderingunits are configured to terminate processing of rays determined to beoccluded.
 22. The system of claim 21, wherein the control unit isconfigured to perform said identification according to a front to backordering of the voxel blocks.
 23. The system of claim 21, wherein saidperforming the forward projection is based on a parallel projection, aperspective projection, or a stereoscopic projection.
 24. The system ofclaim 21, wherein a first of the one or more rendering units isconfigured to determine whether a ray is occluded by comparing anopacity value of the ray to an opacity threshold.
 25. The system ofclaim 21, wherein a first of the one or more rendering units isconfigured to perform space leaping on at least one of the rays of theset of rays in response to an indication that a current one of the voxelblocks and voxel blocks neighboring the current voxel block aretransparent.
 26. The system of claim 21, wherein the first memorycomprises one or more volume memories coupled respectively to the one ormore rendering units, wherein the plurality of voxels are partitionedamong the one or more volume memories.
 27. The system of claim 26,wherein each of the voxel blocks is partitioned among the one or morevolume memories.
 28. The system of claim 27, wherein each of the one ormore rendering units is configured for circular distribution of voxelsamong the one or more rendering units.
 29. The system of claim 21,wherein the frame buffer is partitioned among one or more pixel memoriescoupled respectively to the one or more rendering units.
 30. The systemof claim 29, wherein the control unit is further configured to transferblocks of rays between the frame buffer and the one or more renderingunits.
 31. The system of claim 30, wherein the rays of each block ofrays is distributed among the one or more pixel memories so that each ofthe one or more rendering units processes a corresponding portion of therays in each block of rays.
 32. The system of claim 21, wherein the oneor more rendering units are configured to interpolate samples along therays of said set of rays based on voxels of the transferred voxel block.33. The system of claim 21, wherein a first of the one or more renderingunits is configured to compute gradients from voxels of the transferredvoxel block.
 34. The system of claim 21 further comprising a personalcomputer containing the first memory.
 35. The system of claim 21 furthercomprising a screen for viewing an image stored in the frame buffer. 36.The system of claim 21, where the frame buffer represents a renderedimage of the volume dataset.
 37. The system of claim 21, wherein, foreach of the voxel blocks, the control unit is configured to issue blocksof rays to the one or more rendering units starting from a center ofsaid portion of the frame buffer.
 38. The system of claim 21, wherein afirst of the one or more rendering units includes a ray caster unit,wherein the ray caster unit is configured to operate on rays byperforming calculations including one or more of the following types ofcalculations: reconstruction, classification, shading, composition. 39.The system of claim 38, wherein the ray caster unit is configured toperform composition calculations, and wherein the first rendering unitfurther includes a ray interleave unit configured to interleave rays ofsaid set of rays in order to prevent feedback in said compositioncalculations performed in the ray caster unit.
 40. The system of claim21, wherein the volume dataset is a computed tomography (CT) dataset ora magnetic resonance imaging (MRI) dataset.
 41. The system of claim 21,wherein the volume dataset represents geophysical information.
 42. Thesystem of claim 21, wherein the volume dataset describes one or moreproperties of a fluid or of a chemical system.
 43. The system of claim21, wherein the system is a 3D graphics system.
 44. The system of claim21, wherein the system is a computer aided design (CAD) system.
 45. Thesystem of claim 21, wherein said determination includes determining thatthe voxel block is not selected for transfer based on informationindicating that the voxel block is occluded relative to the currentviewing position.
 46. The system of claim 21, wherein said determinationincludes determining that the voxel block is selected for transfer basedon information indicating that the voxel block is not occluded relativeto the current viewing position and information indicating that thevoxel block is not transparent.
 47. The system of claim 21, wherein saiddetermination includes determining that the voxel block is selected fortransfer based on information indicating that said voxel block istransparent, information indicating that the voxel block is not occludedrelative to a current viewing position, and information indicating thatneighboring voxel blocks of said voxel block are transparent.
 48. Asystem for rendering a volume dataset, wherein the volume datasetincludes a plurality of voxel blocks, wherein each of said voxel blocksincludes two or more voxels, the system comprising: one or morerendering means for performing rendering computations; a first means forstoring said plurality of voxel blocks; a control means for:identifying, by performing a forward projection, a portion of a framebuffer corresponding to each of the voxel blocks; determining whetherthe voxel block is selected for transfer from said first means to saidone or more rendering means, wherein said determination is based uponwhether said voxel block is transparent and whether said voxel block isoccluded relative to a current viewing position; and transferring thevoxel block from the first means to said one or more rendering means inresponse to said determination indicating that the voxel block isselected for transfer; wherein said one or more rendering means comprisemeans for: processing, in a front-to-back order, a set of rays passingthrough the portion of the frame buffer, and terminating the processingof rays determined to be occluded.
 49. The system of claim 48, wherein afirst of said one or more rendering means includes a first buffer forbuffering two slices of voxels.
 50. The system of claim 49, wherein thefirst rendering means includes a second buffer for buffering one sliceof gradient data.
 51. The system of claim 48, where the frame buffer isconfigured to store data representing a two-dimensional array of pixels,wherein each pixel defines a corresponding ray relative to the viewingposition, wherein the stored data for each pixel includes a color, anopacity and a position.
 52. The system of claim 51, wherein the storeddata for each pixel also includes an increment vector.
 53. The system ofclaim 48, wherein said determination includes determining that the voxelblock is not selected for transfer based on information indicating thatthe voxel block is occluded relative to the current viewing position.54. The system of claim 48, wherein said determination includesdetermining that the voxel block is selected for transfer based oninformation indicating that the voxel block is not occluded relative tothe current viewing position and information indicating that the voxelblock is not transparent.
 55. The system of claim 48, wherein saiddetermination includes determining that the voxel block is selected fortransfer based on: information indicating that said voxel block istransparent, information indicating that the voxel block is not occludedrelative to a current viewing position, and information indicating thatneighboring voxel blocks of said voxel block are transparent.
 56. Amethod for rendering a volume dataset, wherein the volume datasetincludes a plurality of voxel blocks, wherein each of said voxel blocksincludes two or more voxels, the method comprising: a computer systemstoring the plurality of voxels in a first memory; for each of the voxelblocks: the computer system identifying, by performing a forwardprojection, a portion of a frame buffer corresponding to the voxelblock; the computer system determining whether the voxel block isselected for retrieval from said first memory, wherein said determiningis based upon whether said voxel block is transparent and whether saidvoxel block is occluded relative to a current viewing position; and thecomputer system retrieving the voxel block from the first memory inresponse to said determination indicating that the voxel block isselected for retrieval; processing, in front-to-back order, a set ofrays passing through the corresponding portion of the frame buffer; andthe computer system terminating processing of rays determined to beoccluded.
 57. The method of claim 56, wherein each of the voxel blocksis retrieved from the first memory at most once per frame.
 58. Themethod of claim 56, wherein said identifying the portion of a framebuffer corresponding to each of said voxel blocks is performed accordingto a front-to-back ordering of the voxel blocks.
 59. The method of claim56 further comprising: displaying an image from the frame buffer. 60.The method of claim 56 further comprising: determining that a ray isoccluded by comparing an opacity value of the ray to an opacitythreshold.
 61. The method of claim 56 further comprising: performingspace leaping on at least one of the rays of said set of rays inresponse to a determination that the voxel block and a plurality ofneighboring voxel blocks are transparent.
 62. The method of claim 56,wherein said determining includes determining that the voxel block isnot selected for retrieval based on information indicating that thevoxel block is occluded relative to the current viewing position. 63.The method of claim 56, wherein said determining includes determiningthat the voxel block is selected for retrieval based on informationindicating that the voxel block is not occluded relative to the currentviewing position and information indicating that the voxel block is nottransparent.
 64. The method of claim 56, wherein said determiningincludes determining that the voxel block is selected for retrievalbased on information indicating that said voxel block is transparent,information indicating that the voxel block is not occluded relative toa current viewing position, and information indicating that neighboringvoxel blocks of said voxel block are transparent.
 65. A volume renderingcontroller configured to: access stored information to determine whethera block of voxels is selected for retrieval from a memory, wherein saidstored information includes at least information specifying whether saidblock is transparent and information specifying whether said block isoccluded relative to a current viewing position; determine, byperforming a forward projection, a portion of a frame buffercorresponding to the block; output a clipping region of the block;control a transfer of the block from the memory onto a first bus inresponse to a determination that the block is selected for retrieval.66. The volume rendering controller of claim 65 further configured to:control a transfer of pixel tiles in the corresponding portion of theframe buffer onto a second bus.
 67. The volume rendering controller ofclaim 65 further configured to: generate a space-leap flag for the blockbased on an examination of said information, wherein the space-leap flagindicates whether space-leaping is to be performed on one or more raysassociated with said portion of the frame buffer; and output the spaceleaping flag for the block.
 68. The volume rendering controller of claim65, wherein the volume rendering controller is further configured todetermine that the block is not selected for retrieval based on theinformation indicating that the block is occluded relative to thecurrent viewing position.
 69. The volume rendering controller of claim65, wherein the volume rendering controller is further configured todetermine that the block is selected for retrieval based on theinformation indicating that the block is not occluded relative to thecurrent viewing position and the information indicating that the blockis not transparent.
 70. The volume rendering controller of claim 65,wherein the volume rendering controller is further configured todetermine that the block is selected for retrieval based on: theinformation indicating that said block is transparent, the informationindicating that the block is not occluded relative to a current viewingposition, and additional information indicating that blocks of voxelsneighboring said block are transparent.
 71. A method comprising:accessing stored information to determine whether a block of voxels isselected for retrieval from a memory, wherein said stored informationincludes at least information specifying whether said block istransparent and information specifying whether said block is occludedrelative to a current viewing position; determining, by performing aforward projection, a portion of a frame buffer corresponding to theblock; outputting a clipping region of the block; controlling a transferof the block from the memory onto a first bus in response to adetermination that the block is selected for retrieval.
 72. The methodof claim 71 further comprising: controlling a transfer of pixel tiles inthe corresponding portion of the frame buffer onto a second bus.
 73. Themethod of claim 71 further comprising: generating a space-leap flag forthe block based on an examination of said information, wherein thespace-leap flag indicates whether space-leaping is to be performed onone or more rays associated with said portion of the frame buffer; andoutputting the space leaping flag for the block.
 74. The method of claim71 further comprising: determining that the block is not selected forretrieval based on the information indicating that the block is occludedrelative to the current viewing position.
 75. The method of claim 71further comprising: determining that the block is selected for retrievalbased on the information indicating that the block is not occludedrelative to the current viewing position and the information indicatingthat the block is not transparent.
 76. The method of claim 71 furthercomprising: determining that the block is selected for retrieval basedon: the information indicating that said block is transparent, theinformation indicating that the block is not occluded relative to acurrent viewing position, and additional information indicating thatblocks of voxels neighboring said block are transparent.
 77. A medicalimaging system for rendering a volume dataset, wherein the volumedataset includes a plurality of voxel blocks, wherein each of said voxelblocks includes two or more voxels, the system comprising: one or morerendering units; a first memory configured to store said plurality ofvoxel blocks; a control unit, wherein, for each of said plurality ofvoxel blocks, said control unit is configured to: identify, byperforming a forward projection, a portion of a frame buffercorresponding to the voxel block; determine whether the voxel block isselected for transfer from said first memory to the one or morerendering units, wherein said determination is based upon whether saidvoxel block is transparent and whether said voxel block is occludedrelative to a current viewing position; and transfer the voxel blockfrom the first memory to said one or more rendering units in response tosaid determination indicating that the voxel block is selected fortransfer; wherein, for each voxel block, said one or more renderingunits are configured to process, in front-to-back order, a set of rayspassing through the corresponding portion of the frame buffer, andwherein the one or more rendering units are configured to terminateprocessing of rays determined to be occluded.
 78. The medical imagingsystem of claim 77, wherein the volume dataset is a medical informationdataset.
 79. The medical imaging system of claim 77, wherein saiddetermination includes determining that the voxel block is not selectedfor transfer based on information indicating that the voxel block isoccluded relative to the current viewing position.
 80. The medicalimaging system of claim 77, wherein said determination includesdetermining that the voxel block is selected for transfer based oninformation indicating that the voxel block is not occluded relative tothe current viewing position and information indicating that the voxelblock is not transparent.
 81. The medical imaging system of claim 77,wherein said determination includes determining that the voxel block isselected for transfer based on: information indicating that said voxelblock is transparent, information indicating that the voxel block is notoccluded relative to a current viewing position, and informationindicating that neighboring voxel blocks of said voxel block aretransparent.
 82. A system for rendering a volume dataset, wherein thevolume dataset includes a plurality of voxel blocks, wherein each ofsaid voxel blocks includes an array of voxels, the system comprising: aplurality of rendering units; a first memory configured to store saidplurality of voxel blocks; a control unit, wherein, for each of saidplurality of voxel blocks, said control unit is configured to: identify,by performing a forward projection, a portion of a frame buffercorresponding to the voxel block; determine whether the voxel block isselected for transfer from said first memory to at least one of theplurality of rendering units, wherein said determination is based uponinformation regarding whether said voxel block is transparent andinformation regarding whether said voxel block is occluded relative to acurrent viewing position; and transfer the voxel block from the firstmemory to said at least one rendering unit in response to saiddetermination indicating that the voxel block is selected for transfer;wherein, for each voxel block, said at least one rendering unit isconfigured to process, in front-to-back order, a set of rays passingthrough the corresponding portion of the frame buffer, and wherein saidat least one rendering unit is configured to perform early raytermination on rays determined to be occluded.
 83. The system of claim82, wherein the control unit is configured to perform saididentification of the portion of the frame buffer corresponding to eachof said voxel blocks according to a front-to-back ordering of the voxelblocks.
 84. The system of claim 82, wherein the at least one renderingunit is configured to determine that a ray is occluded by comparing anopacity value of the ray to an opacity threshold.
 85. The system ofclaim 82, wherein the at least one rendering unit is configured toperform space leaping on at least one of the rays of the set of rays inresponse to an indication that a current one of the voxel blocks istransparent.
 86. The system of claim 82, wherein the at least onerendering unit is configured to interpolate samples along one or more ofthe rays of said set of rays based on voxels of the transferred voxelblock.
 87. The system of claim 82, wherein the array of voxels is arectangular array.
 88. The system of claim 82, wherein the array ofvoxels is a cubic array.
 89. The system of claim 82, wherein saiddetermination includes determining that the voxel block is not selectedfor transfer based on information indicating that the voxel block isoccluded relative to the current viewing position.
 90. The system ofclaim 82, wherein said determination includes determining that the voxelblock is selected for transfer based on information indicating that thevoxel block is not occluded relative to the current viewing position andinformation indicating that the voxel block is not transparent.
 91. Thesystem of claim 82, wherein said determination includes determining thatthe voxel block is selected for transfer based on: informationindicating that said voxel block is transparent, information indicatingthat the voxel block is not occluded relative to a current viewingposition, and information indicating that neighboring voxel blocks ofsaid voxel block are transparent.